Scientific Python antipatterns advent calendar day seventeen

For today, a very common task in scientific programming that can be made much more robust with a bit of effort. As a reminder, I’ll post one tiny example per day with the intention that they should only take a couple of minutes to read.

If you want to read them all but can’t be bothered checking this website each day, sign up for the mailing list:

Sign up for the mailing list

and I’ll send a single email at the end with links to them all.

Processing files in a folder

Imagine we have a folder with a collection of files that we need to process in some way. Listing the contents is straightforward once we figure out the path. To keep the path relatively short, let’s say the folder is in my home directory:

import os

os.listdir('/home/martin/example_files')
['README.md', '.ipynb_checkpoints', 'banana.txt', 'apple.txt']

There are two text files that we want to process, and two files that we want to skip. If you’re used to using Windows machines then the path might look weird to you, but don’t worry about it for now - we will come back to that later.

The first thing that most beginners try is this:

for filename in os.listdir('/home/martin/example_files'):
    print('processing ' + filename)
    file = open(filename)
processing README.md
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[7], line 3
      1 for filename in os.listdir('/home/martin/example_files'):
      2     print('processing ' + filename)
----> 3     file = open(filename)

File ~/miniforge3/envs/ml/lib/python3.13/site-packages/IPython/core/interactiveshell.py:343, in _modified_open(file, *args, **kwargs)
    336 if file in {0, 1, 2}:
    337     raise ValueError(
    338         f"IPython won't let you open fd={file} by default "
    339         "as it is likely to crash IPython. If you know what you are doing, "
    340         "you can use builtins' open."
    341     )
--> 343 return io_open(file, *args, **kwargs)

FileNotFoundError: [Errno 2] No such file or directory: 'README.md'

but we immediately run into an error. When we pass a filename to open, Python looks in the current working directory, which is not the location of our files.

At this point we realise that we have to construct the path to the file. We might try concatenation:

for filename in os.listdir('/home/martin/example_files'):
    print('processing ' + filename)
    file = open('/home/martin/example_files' + filename)
processing README.md
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[9], line 3
      1 for filename in os.listdir('/home/martin/example_files'):
      2     print('processing ' + filename)
----> 3     file = open('/home/martin/example_files' + filename)

File ~/miniforge3/envs/ml/lib/python3.13/site-packages/IPython/core/interactiveshell.py:343, in _modified_open(file, *args, **kwargs)
    336 if file in {0, 1, 2}:
    337     raise ValueError(
    338         f"IPython won't let you open fd={file} by default "
    339         "as it is likely to crash IPython. If you know what you are doing, "
    340         "you can use builtins' open."
    341     )
--> 343 return io_open(file, *args, **kwargs)

FileNotFoundError: [Errno 2] No such file or directory: '/home/martin/example_filesREADME.md'

This also causes an error, but if we look closely we will realise that we have forgotten the folder separator, which on my Linux system is /. So let’s add it to the end of the path:

for filename in os.listdir('/home/martin/example_files/'):
    print('processing ' + filename)
    file = open('/home/martin/example_files/' + filename)
processing README.md
processing .ipynb_checkpoints
---------------------------------------------------------------------------
IsADirectoryError                         Traceback (most recent call last)
Cell In[10], line 3
      1 for filename in os.listdir('/home/martin/example_files/'):
      2     print('processing ' + filename)
----> 3     file = open('/home/martin/example_files/' + filename)

File ~/miniforge3/envs/ml/lib/python3.13/site-packages/IPython/core/interactiveshell.py:343, in _modified_open(file, *args, **kwargs)
    336 if file in {0, 1, 2}:
    337     raise ValueError(
    338         f"IPython won't let you open fd={file} by default "
    339         "as it is likely to crash IPython. If you know what you are doing, "
    340         "you can use builtins' open."
    341     )
--> 343 return io_open(file, *args, **kwargs)

IsADirectoryError: [Errno 21] Is a directory: '/home/martin/example_files/.ipynb_checkpoints'

The next error comes when we realise that we actually have another, hidden folder inside the folder that we want to process. In this case it’s a folder generated by Jupyter notebook itself. So we need to add some logic to skip it:

for filename in os.listdir('/home/martin/example_files/'):
    if not filename.startswith('.'):
        print('processing ' + filename)
        file = open('/home/martin/example_files/' + filename)
processing README.md
processing banana.txt
processing apple.txt

Now we nearly have some working code; we just need to add the logic to also skip the .md file:

for filename in os.listdir('/home/martin/example_files/'):
    if not (filename.startswith('.') or filename.endswith('.md')):
        print('processing ' + filename)
        file = open('/home/martin/example_files/' + filename)
processing banana.txt
processing apple.txt

Finally we have the logic that we want. In this case we have only two files, but real datasets will often have hundreds.

There are several problems with this code. Firstly, it’s a dangerous pattern to have the path hard-coded in two places; it makes it very easy to accidentally change one path but forget to change the other:

for filename in os.listdir('/home/martin/example_files2/'):
    if not (filename.startswith('.') or filename.endswith('.md')):
        print('processing ' + filename)
        file = open('/home/martin/example_files/' + filename)
processing strawberry.txt
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[15], line 4
      2 if not (filename.startswith('.') or filename.endswith('.md')):
      3     print('processing ' + filename)
----> 4     file = open('/home/martin/example_files/' + filename)

File ~/miniforge3/envs/ml/lib/python3.13/site-packages/IPython/core/interactiveshell.py:343, in _modified_open(file, *args, **kwargs)
    336 if file in {0, 1, 2}:
    337     raise ValueError(
    338         f"IPython won't let you open fd={file} by default "
    339         "as it is likely to crash IPython. If you know what you are doing, "
    340         "you can use builtins' open."
    341     )
--> 343 return io_open(file, *args, **kwargs)

FileNotFoundError: [Errno 2] No such file or directory: '/home/martin/example_files/strawberry.txt'

which leads to bugs that can be hard to track down. We might fix this by making the path a variable:

folder_path = '/home/martin/example_files/'

for filename in os.listdir(folder_path):
    if not (filename.startswith('.') or filename.endswith('.md')):
        print('processing ' + filename)
        file = open(folder_path + filename)
processing banana.txt
processing apple.txt

which will also make it a bit easier to change. But we can’t guarantee that the eventual user of this code will include the trailing folder separator in the folder path. If they don’t:

folder_path = '/home/martin/example_files'

for filename in os.listdir(folder_path):
    if not (filename.startswith('.') or filename.endswith('.md')):
        print('processing ' + filename)
        file = open(folder_path + filename)
processing banana.txt
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[17], line 6
      4 if not (filename.startswith('.') or filename.endswith('.md')):
      5     print('processing ' + filename)
----> 6     file = open(folder_path + filename)

File ~/miniforge3/envs/ml/lib/python3.13/site-packages/IPython/core/interactiveshell.py:343, in _modified_open(file, *args, **kwargs)
    336 if file in {0, 1, 2}:
    337     raise ValueError(
    338         f"IPython won't let you open fd={file} by default "
    339         "as it is likely to crash IPython. If you know what you are doing, "
    340         "you can use builtins' open."
    341     )
--> 343 return io_open(file, *args, **kwargs)

FileNotFoundError: [Errno 2] No such file or directory: '/home/martin/example_filesbanana.txt'

Then it will still work in the listdir, but will break for the open. So we might have to add it explicitly:

folder_path = '/home/martin/example_files'

for filename in os.listdir(folder_path):
    if not (filename.startswith('.') or filename.endswith('.md')):
        print('processing ' + filename)
        file = open(folder_path + '/' + filename)
processing banana.txt
processing apple.txt

This makes the file path construction string more complicated and error-prone.

Another problem with the code is the complexity of the filename filtering. As a general rule, rather than try to list all of the possible patterns to skip, it’s more robust to do a positive selection:

folder_path = '/home/martin/example_files'

for filename in os.listdir(folder_path):
    if filename.endswith('.txt'):
        print('processing ' + filename)
        file = open(folder_path + '/' + filename)
processing banana.txt
processing apple.txt

However, an even better solution is to use glob from the standard library. The glob module and function takes care of listing the files in our target folder, filtering to give just the ones that we want, and constructing complete paths all in one go:

import glob

glob.glob('/home/martin/example_files/*.txt')
['/home/martin/example_files/banana.txt',
 '/home/martin/example_files/apple.txt']

giving us a list of complete paths that we can plug straight into open:

for filepath in glob.glob('/home/martin/example_files/*.txt'):
    print('processing ' + filepath)
    file = open(filepath)
processing /home/martin/example_files/banana.txt
processing /home/martin/example_files/apple.txt

As a nice side effect, we get much more useful debugging output that includes the exact path to the files that we will be processing. If you have used the Linux/Mac command line at all then you probably already know the special syntax that glob uses for specifying folder and file names, but if not then it’s easy to learn.

Once we have this pattern set up it’s easy to use it to do more complicated things, like list files in multiple folders:

glob.glob('/home/martin/example_files*/*.txt')
['/home/martin/example_files/banana.txt',
 '/home/martin/example_files/apple.txt',
 '/home/martin/example_files2/strawberry.txt']

Bonus: the pathlib module from the standard library also contains many useful functions for path construction and manipulation.

One more time; if you want to see the rest of these little write-ups, sign up for the mailing list:

Sign up for the mailing list